Daily Trend [12-04]

【1】Think before you speak: Training Language Models With Pause Tokens

【Time】2023-10-03

一、研究领域

NLP，延迟推理

二、研究动机

现有的 LM 框架中，生成第 K+1 个 token 总是操作看到的前 K 个 token（而不会更多），相反，作者希望花费超过 K 次操作来生成下一个 token，因此，作者提出通过向输入附加 M 个虚拟 token 来综合增加输入序列长度，从而将模型的下一个响应延迟 M 个输入 token。

In the current paradigm of language models, we compute exactly K embeddings v1, . . . vK in each layer, before generating the (K + 1)th token, pK+1. Our premise is that this limit of K operations is an arbitrary one. Instead, we wish to expend more than K operations towards producing the next token, pK+1. While something to this effect could be achieved by increasing the number of attention heads in each layer, we are interested in an orthogonal approach that introduces hardly any parameters into the network. The idea is to synthetically increase the input sequence length by appending M dummy tokens to the input, thus delaying the model’s next response by M tokens of input. In effect, this M -token-delay lets the model manipulate an additional set of M intermediate vectors vK+1, . . . , vK+M before committing to its next (output) token, pK+1. Intuitively, these vectors could provide a richer representation of the input (e.g., by attending differently), thus resulting in a better next token from the model.

三、方法与技术

（1）在Pretraing过程中，选用””作为虚拟token，随机均匀地插入到预训练数据中，在训练损失中忽略””对应的损失项。

Figure 2: Standard vs. pause-pretraining.

（2）在Finetuning过程中，将””插入到给定前缀的后面，构造新的前缀，但是不改变监督目标，在具有新前缀的目标上应用标准的next-token prediction loss。

Figure 1: Standard vs. pause-inference (and finetuning).

（3）在Inference过程中，同样将””插入到给定前缀的后面，并且和前面一样，忽略模型的输出，直到遇到最后一个 token。

Untitled

（4）作者排列组合标准/“暂停”(方式) 预训练/微调(阶段) 做了很多实验，以及对 token的数量、嵌入位置、嵌入阶段做了消融实验，最后得出结论：暂停训练超越了语言模型中“立即”下一个标记预测的范式。关键思想是使用（虚拟）标记来训练模型，以便模型可以学习利用额外的推理时间计算。实验证明，如果在预训练和下游微调期间使用标记进行训练，这可以提高各种任务的性能。

Pause-training takes a step beyond the paradigm of “immediate” next-token prediction in language models. The key idea is to train models with (dummy) tokens tokens so that the model can learn to harness the additional inference-time computation. We demonstrate that this can improve performance on a variety of our tasks, if we train with tokens both during pretraining and downstream finetuning.

四、总结

同样都是讲故事，而且方法和模型也很类似，但是写的不如Vision Transformers Need Registers这篇文章精彩，另外它总给笔者一种在“把完全不怎么work的东西写成一篇看起来很不错的文章”的感觉。

五、推荐相关阅读

Vision Transformers Need Registers